Smart Spend: A Data-Driven Framework for Google Ads

MKTG - LEE

Executive Summary

  • This project analyzes a simulated Kaggle dataset designed to mimic Google Ads performance. Because the data is artificial and does not reflect real-world user behavior, our goal is not to create production-ready predictions but to demonstrate a rigorous, interpretable, and scalable analytics framework for ad optimization.

  • Starting from 2,600 raw rows and 13 columns, we cleaned and engineered key marketing metrics (CPC, CPA, CVR, CTR, ROAS), added temporal features, and applied 7-day rolling averages to strengthen signal stability. Exploratory analysis revealed wide ROAS variability and strong relationships between cost, impressions, and revenue.

  • Using K-means segmentation (k=4), ROAS-based targeting, RFM scoring, ROMI calculation and Random Forest modeling, we identified clear performance patterns and reliable levers for optimization. Although the data is synthetic, the analytical workflow and insights mirror real-world practices and provide a transferable blueprint for budget allocation and strategic targeting.

Introduction & Data Set Used

  • We chose this Google Ads dataset to gain better understanding of how to optimize ad spend and logistics by incresing the return on ad spend (ROAS) using insights from campaign, keyword, device, location, and conversion data.

  • The data set we analyze comes from Kaggle

Data Preprocessing and Exploratory Data Analysis

  • We explore the data set by visualizing the main variables of interest.
# Load the Libraries necessary for the following analysis
## Core Libraries
import polars as pl
import pandas as pd
import numpy as np
import math
from pathlib import Path
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

## Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

## SciPy Optimization
from scipy.optimize import minimize

## Scikit-Learn
from sklearn.cluster import KMeans
from sklearn.preprocessing import (StandardScaler, LabelEncoder, OneHotEncoder)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import (LinearRegression, LogisticRegression)
from sklearn.metrics import ( confusion_matrix, classification_report, accuracy_score, mean_absolute_error, r2_score)
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import mean_squared_error
  • Data Cleansing
# Data cleansing has been completed in the file below
file_path = r'/workspaces/mktg_codespace/Final Project/Google_Ad_Cleaned (1).csv'
df = pl.read_csv(file_path)
print(df.head())
print("Shape:", df.shape)
shape: (5, 25)
┌───────┬────────┬─────────────┬────────┬───┬──────┬───────┬──────┬──────┐
│ Ad_ID ┆ Clicks ┆ Impressions ┆ Cost   ┆ … ┆ CVR  ┆ CPA   ┆ ROAS ┆ CTR  │
│ ---   ┆ ---    ┆ ---         ┆ ---    ┆   ┆ ---  ┆ ---   ┆ ---  ┆ ---  │
│ str   ┆ f64    ┆ f64         ┆ f64    ┆   ┆ f64  ┆ f64   ┆ f64  ┆ f64  │
╞═══════╪════════╪═════════════╪════════╪═══╪══════╪═══════╪══════╪══════╡
│ A1000 ┆ 104.0  ┆ 4498.0      ┆ 231.88 ┆ … ┆ 0.07 ┆ 33.13 ┆ 8.16 ┆ 0.02 │
│ A1001 ┆ 173.0  ┆ 5107.0      ┆ 216.84 ┆ … ┆ 0.05 ┆ 27.11 ┆ 7.74 ┆ 0.03 │
│ A1002 ┆ 90.0   ┆ 4544.0      ┆ 203.66 ┆ … ┆ 0.1  ┆ 22.63 ┆ 7.97 ┆ 0.02 │
│ A1003 ┆ 142.0  ┆ 3185.0      ┆ 237.66 ┆ … ┆ 0.04 ┆ 39.61 ┆ 5.15 ┆ 0.04 │
│ A1004 ┆ 156.0  ┆ 3361.0      ┆ 195.9  ┆ … ┆ 0.05 ┆ 24.49 ┆ 5.57 ┆ 0.05 │
└───────┴────────┴─────────────┴────────┴───┴──────┴───────┴──────┴──────┘
Shape: (2345, 25)
  • Exploratory Data Analysis # Summary Statistics
print('\n'+'='*80)
print('Descriptive Statistic')
print('='*80)

print('\n【Numerical Statistic Summary】')
num_cols = ["Clicks", "Impressions", "Cost", "Leads", "Conversions"]
summary = df.select(num_cols).describe()
print(summary)

================================================================================
Descriptive Statistic
================================================================================

【Numerical Statistic Summary】
shape: (9, 6)
┌────────────┬────────────┬─────────────┬────────────┬───────────┬─────────────┐
│ statistic  ┆ Clicks     ┆ Impressions ┆ Cost       ┆ Leads     ┆ Conversions │
│ ---        ┆ ---        ┆ ---         ┆ ---        ┆ ---       ┆ ---         │
│ str        ┆ f64        ┆ f64         ┆ f64        ┆ f64       ┆ f64         │
╞════════════╪════════════╪═════════════╪════════════╪═══════════╪═════════════╡
│ count      ┆ 2345.0     ┆ 2345.0      ┆ 2345.0     ┆ 2307.0    ┆ 2345.0      │
│ null_count ┆ 0.0        ┆ 0.0         ┆ 0.0        ┆ 38.0      ┆ 0.0         │
│ mean       ┆ 138.561194 ┆ 4528.773561 ┆ 215.066665 ┆ 19.988296 ┆ 6.516844    │
│ std        ┆ 33.890429  ┆ 862.575657  ┆ 22.074333  ┆ 6.031095  ┆ 2.270299    │
│ min        ┆ 80.0       ┆ 3000.0      ┆ 124.39     ┆ 10.0      ┆ 3.0         │
│ 25%        ┆ 110.0      ┆ 3781.0      ┆ 197.03     ┆ 15.0      ┆ 5.0         │
│ 50%        ┆ 139.0      ┆ 4525.0      ┆ 215.41     ┆ 20.0      ┆ 7.0         │
│ 75%        ┆ 167.0      ┆ 5275.0      ┆ 233.25     ┆ 25.0      ┆ 9.0         │
│ max        ┆ 199.0      ┆ 5997.0      ┆ 304.74     ┆ 30.0      ┆ 10.0        │
└────────────┴────────────┴─────────────┴────────────┴───────────┴─────────────┘

Calculate Critical Business Metric

print('\n【Critical Business Metric】')

total_cost = df.select(pl.sum("Cost")).item()
total_sales = df.select(pl.sum("Sale_Amount")).item()
total_conversions = df.select(pl.sum("Conversions")).item()
total_clicks = df.select(pl.sum("Clicks")).item()

def safe_item(expr_result):
    if expr_result is None:
        return 0
    val = expr_result.item()
    if val is None or (isinstance(val,float) and math.isnan(val)):
        return 0
    return val

roi = ((total_sales - total_cost) / total_cost) * 100
avg_roas = safe_item(df.select(pl.mean("ROAS")))
avg_cpa = safe_item(df.select(pl.mean("CPA")))
avg_cvr = safe_item(df.select(pl.mean("CVR")))
avg_cpc = safe_item(df.select(pl.mean("CPC")))

print(f'''
Total Performance:
--- Total Cost: ${total_cost:,.2f}
--- Total Sales: ${total_sales:,.2f}
--- Total Conversions: {total_conversions:,.0f}
--- Total Clicks: {total_clicks:,.0f}

Benefit Indicators:
--- ROI: {roi:.1f}%
--- Average ROAS: {avg_roas:.2f}x
--- Average CPA: ${avg_cpa:.2f}
--- Average CVR: {avg_cvr * 100:.2f}%
--- Average CPC: ${avg_cpc:.2f}
''')

【Critical Business Metric】

Total Performance:
--- Total Cost: $504,331.33
--- Total Sales: $3,512,967.00
--- Total Conversions: 15,282
--- Total Clicks: 324,926

Benefit Indicators:
--- ROI: 596.6%
--- Average ROAS: 7.04x
--- Average CPA: $38.16
--- Average CVR: 5.04%
--- Average CPC: $1.65

Device Analysis

print('\n【Performance Analysis of Each Device】')

device_analysis = (
    df.group_by("Device")
    .agg([
        pl.count("Ad_ID").alias("Ad_count"),
        pl.sum("Cost").alias("Total_cost"),
        pl.sum("Sale_Amount").alias("Total_sale"),
        pl.sum("Conversions").alias("Total_conversions"),
        pl.mean("ROAS").alias("Avg_ROAS"),
        pl.mean("CPA").alias("Avg_CPA"),
        pl.mean("CVR").alias("Avg_CVR")
    ])
    .with_columns([
        ((pl.col("Total_sale") - pl.col("Total_cost")) / pl.col("Total_cost") * 100)
        .round(1)
        .alias("ROI%")
    ])
)

print(device_analysis)

best_device_row = device_analysis.sort("ROI%", descending=True).head(1)
best_device = best_device_row["Device"][0]
best_roi = best_device_row["ROI%"][0]

print(f'\nBest Device: {best_device} (ROI: {best_roi:.1f}%)')

【Performance Analysis of Each Device】
shape: (3, 9)
┌─────────┬──────────┬────────────┬────────────┬───┬──────────┬───────────┬──────────┬───────┐
│ Device  ┆ Ad_count ┆ Total_cost ┆ Total_sale ┆ … ┆ Avg_ROAS ┆ Avg_CPA   ┆ Avg_CVR  ┆ ROI%  │
│ ---     ┆ ---      ┆ ---        ┆ ---        ┆   ┆ ---      ┆ ---       ┆ ---      ┆ ---   │
│ str     ┆ u32      ┆ f64        ┆ f64        ┆   ┆ f64      ┆ f64       ┆ f64      ┆ f64   │
╞═════════╪══════════╪════════════╪════════════╪═══╪══════════╪═══════════╪══════════╪═══════╡
│ Mobile  ┆ 792      ┆ 171201.68  ┆ 1.185958e6 ┆ … ┆ 6.998699 ┆ 38.020278 ┆ 0.051111 ┆ 592.7 │
│ Tablet  ┆ 754      ┆ 161796.47  ┆ 1.129394e6 ┆ … ┆ 7.055027 ┆ 38.382878 ┆ 0.049257 ┆ 598.0 │
│ Desktop ┆ 799      ┆ 171333.18  ┆ 1.197615e6 ┆ … ┆ 7.06766  ┆ 38.090213 ┆ 0.050751 ┆ 599.0 │
└─────────┴──────────┴────────────┴────────────┴───┴──────────┴───────────┴──────────┴───────┘

Best Device: Desktop (ROI: 599.0%)

Visualizations

print('\n'+'='*80)
print('Visualization')
print('\n'+'='*80)

# ROAS Distribution
roas_data = df.select('ROAS').drop_nulls().to_series().to_list()
plt.figure(figsize=(9,6))
plt.hist(roas_data, bins=30, color='skyblue',edgecolor='black',alpha=0.7)
plt.axvline(df.select('ROAS').mean().item(), color='red',linestyle='--',linewidth=2,
label=f"Avg: {df.select('ROAS').mean().item():.2f}x")
plt.axvline(df.select('ROAS').median().item(),color='green',linestyle='--',linewidth=2,
label=f"median: {df.select('ROAS').median().item():.2f}x")
plt.xlabel('ROAS',fontsize=12)
plt.ylabel('Frequency',fontsize=12)
plt.title('ROAS Distribution',fontsize=14,fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

# Average ROAS for each device
device_roas = (
    df.filter(pl.col('Device').is_not_null())
    .with_columns(
        pl.col('ROAS')
            .cast(pl.Float64)
            .fill_null(0)
    )
    .group_by('Device')
    .agg(pl.col('ROAS').mean().alias('Avg_ROAS'))
    .filter(pl.col('Avg_ROAS').is_not_null())
    .sort('Avg_ROAS',descending=True)
)

devices = device_roas['Device'].to_list()
roas_values = device_roas['Avg_ROAS'].to_list()

colors = plt.cm.Set2.colors[:len(devices)]

plt.figure(figsize=(8, 6))
bars = plt.bar(devices, roas_values, color=colors, edgecolor='black', linewidth=1.5)
plt.ylabel('Avg. ROAS', fontsize=12)
plt.xlabel('Device', fontsize=12)
plt.title('Average ROAS per Device', fontsize=14, fontweight='bold')

for i, val in enumerate(roas_values):
    if val is not None:
        plt.text(i, val + 0.05, f"{val:.4f}x", ha='center', fontsize=11, fontweight='bold')

plt.grid(axis='y', alpha=0.3)
plt.show()

# Total Sales of each device
device_sales = df.group_by('Device').agg([
    pl.col('Sale_Amount').sum().alias('Total_sales')
]).sort('Total_sales',descending=True)

devices = device_sales['Device'].to_list()
sales_values = [x/1000 for x in device_sales['Total_sales'].to_list()]

colors = plt.cm.Paired.colors[:len(devices)]

plt.figure(figsize=(9,6))
bars = plt.bar(devices,sales_values,color=colors,edgecolor='black',linewidth=1.5)

plt.xlabel('Device',fontsize=12)
plt.ylabel('Total Sales ($K)',fontsize=12)
plt.title('Total Sales by Device',fontsize=14,fontweight='bold')
plt.grid(alpha=0.3,axis='y')

for bar,value in zip(bars,sales_values):
    plt.text(bar.get_x()+bar.get_width()/2,bar.get_height(),f'{value:.1f}',
    ha='center',va='bottom',fontsize=10,color='black')

plt.show()

# ROAS vs Sale_Amount Scatterplot
df_scatter = df.select(['ROAS','Sale_Amount']).to_pandas()
plt.scatter(df_scatter['ROAS'],df_scatter['Sale_Amount'],alpha=0.5,c='purple',s=30,edgecolor='black',linewidth=0.5)
plt.xlabel('ROAS',fontsize=12)
plt.ylabel('Sale Amount ($)',fontsize=12)
plt.title('ROAS vs Sale Amount',fontsize=14,fontweight='bold')
plt.grid(alpha=0.3)
plt.show()

# Correlation Heatmap
key_cols = ['Cost','Conversions','Sale_Amount','ROAS','CPA','CVR']
corr_df = df.select(key_cols).to_pandas()
corr_matrix = corr_df.corr()

sns.heatmap(corr_matrix,annot=True,fmt='.2f',cmap='coolwarm',center=0,square=True,linewidths=2,cbar_kws={'shrink':0.8},annot_kws={'fontsize':10,'fontweight':'bold'})
plt.title('Correlation Heatmap',fontsize=14,fontweight='bold')
plt.show()

# Boxplot - ROAS Distribution of Each Device
df_box = df.select(['Device','ROAS']).to_pandas()
sns.boxplot(data=df_box,x='Device',y='ROAS',palette='Set2')
plt.ylabel('ROAS',fontsize=12)
plt.xlabel('ROAS Distribution Of Each Device',fontsize=14,fontweight='bold')
plt.grid(axis='y',alpha=0.3)

medians = df_box.groupby('Device')['ROAS'].median()
positions = range(len(medians))
for i, (device,median) in enumerate(medians.items()):
    plt.text(
        i,
        median,
        f'{median:.4f}',
        ha='center',
        va='bottom',
        fontweight='bold',
        color='black',
        fontsize=10
    )

plt.show()

================================================================================
Visualization

================================================================================

#1 ROAS Distribution: This histogram chart displays a roughly normal distribution, where the average is 7.04x and the median is 6.95x.

#2 Average ROAS per Device: The bar chart compares the average ROAS across different devices, and shows the performance of each device has negligible variation.

#3 Total Sales by Device: The bar chart compares the total sales across different devices, where Desktop leads slightly with 1197.6, closely followed by Mobile’s 1186.0, and the Tablet has the lowest 1129.4 total sales.

#4 ROAS vs Sale Amount: The scatter plot reveals a strong positive correlation between ROAS and Sale Amount, indicating that higher the ad efficiency drives higher sales revenue.

#5 Correlation Heatmap: The heatmap highlights a strong positive correlation between ROAS and Sale Amount (0.87), constracting with a strong negative correlation between Conversions and CPA (-0.91).

#6 ROAS Distribution of Each Device: The boxplot compares the ROAS distribution across devices, revealing three of them have nearly identical performance with medians hovering around 6.95.

After completing EDA, we will proceed to conduct more advanced analyses based on the insights we discovered during the EDA process.

CUSTOMER SEGMENTATION

# Select the base variables for clustering and scale the data.
df_bases = df.select([
    "Clicks", 
    "Impressions", 
    "Cost", 
    "Conversions"
    ])
df_bases = df_bases.drop_nulls()

# Create K-Means Clustering Pipeline
def create_pipeline(num_clusters, random_seed = 42):
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('kmeans', KMeans(n_clusters=num_clusters, random_state=random_seed))
    ])
    return pipeline

# Determining Optimal Clusters with the Elbow Method
def calculate_totwithinss(data, k):
    kmeans_pipeline = create_pipeline(k, random_seed=10)
    kmeans_pipeline.fit(data)
    return kmeans_pipeline['kmeans'].inertia_

# Calculate tot.withinss for different values of k
k_values = range(1, 10)
totwithinss_values = [calculate_totwithinss(df_bases, k) for k in k_values]

# Create a DataFrame for results
kmeans_results = pl.DataFrame(
    {'num_clusters': k_values,
     'tot_withinss': totwithinss_values}) # df for elbow plot

# Plot the elbow method using Plotly Express
elbow_plot = px.line(
    data_frame = kmeans_results,
    x = 'num_clusters',
    y = 'tot_withinss', 
    markers = True,
    labels = {
        'num_clusters': 'Number of Clusters', 'tot_withinss': 'Total Within SS'
        },
    title = 'Elbow Method for Optimal k')

elbow_plot.show()

# Choose the number of clusters based on the elbow method
optimal_k = 4

# Run K-means clustering

campaign_kmeans_pipeline = create_pipeline(optimal_k)
campaign_kmeans_pipeline.fit(df_bases)

# Add cluster assignments to the original data
labels = campaign_kmeans_pipeline['kmeans'].labels_ + 1  # make 1-indexed

df_with_clusters = df_bases.with_columns(
    pl.Series("segment_number", labels.astype(str))  # ensure same length & dtype
)

df_with_clusters.head()

# Segment Description
# Calculate summary statistics for each segment
segment_summary = df_with_clusters.group_by('segment_number').agg(
    [
        pl.mean('Clicks').alias('Mean_Clicks'),
        pl.mean('Impressions').alias('Mean_Impressions'),
        pl.mean('Cost').alias('Mean_Cost'),
        pl.mean('Conversions').alias('Mean_Conversions'),
        pl.len().alias('n')
    ]
)

segment_summary
shape: (4, 6)
segment_number Mean_Clicks Mean_Impressions Mean_Cost Mean_Conversions n
str f64 f64 f64 f64 u32
"2" 103.809783 4321.668478 226.394076 7.425725 552
"4" 161.732759 3638.093103 211.072397 6.744828 580
"3" 163.690598 5134.8 231.602171 6.302564 585
"1" 124.297771 4968.886943 193.395764 5.707006 628

Customer Segmentation Conclusion:

Segment 1 - Cost-efficient segment with steady conversions. - Indicates effective targeting but limited scale.

Segment 2 - Highest conversions despite higher costs. - Strong, high-intent audience. Suitable for selective investment and optimization.

Segment 3 - Largest reach (highest impressions and clicks) but only average conversions. - Represents the biggest improvement opportunity through better targeting or creative refinement.

Segment 4 - Strong conversion efficiency with the lowest impressions. - Performs well in a small audience; an ideal candidate for controlled scaling tests.

TARGETING ANALYSIS

# Set random seed for reproducibility
np.random.seed(10)

# Create outcome variable: ROAS_segment & define thresholds based on quantiles 

# Ensure ROAS exists (safe: NA when Cost <= 0 or null)
df = df.with_columns(
    pl.when((pl.col("Cost") > 0) & pl.col("Sale_Amount").is_not_null())
      .then(pl.col("Sale_Amount") / pl.col("Cost"))
      .otherwise(None)
      .alias("ROAS")
)

# keep onlt rows where we can label the target
df_lbl = df.filter(pl.col("ROAS").is_not_null())


# Get 33rd and 66th percentile cutoffs as Python floats
q = df.select([
    pl.col("ROAS").quantile(0.33).alias("q33"),
    pl.col("ROAS").quantile(0.66).alias("q66")
]).row(0)
q33, q66 = q[0], q[1]

# Create categorical target with literals
df = df.with_columns(
    pl.when(pl.col("ROAS").is_null()).then(pl.lit(None))                 # keep nulls as null ROAS_segment
     .when(pl.col("ROAS") >= pl.lit(q66)).then(pl.lit("High_ROAS")) # high = greater than 3.0
     .when(pl.col("ROAS") >= pl.lit(q33)).then(pl.lit("Medium_ROAS")) # medium = greater than 1.5
     .otherwise(pl.lit("Low_ROAS")) # if else low
     .alias("ROAS_segment")
)

# Quick check
print(df.select("ROAS_segment").to_pandas().value_counts())

# Convert to Pandas DataFrame for easier handling with sklearn
df_targeting = df.to_pandas()

# Encode the target variable using LabelEncoder
label_encoder = LabelEncoder()

df_targeting['ROAS_segment'] = label_encoder.fit_transform(df['ROAS_segment'])

# Define features and target
features = ["Device", "Clicks", "Impressions", "Cost", "Conversions"]
target = "ROAS_segment"

# Split data into train, test and predict sets
old, new = train_test_split(df_targeting, test_size = 0.3)
train, test = train_test_split(old, test_size = 0.3, random_state=101)

# Define column transformer for one-hot encoding categorical features
categorical_features = ["Device",]
numeric_features = ["Clicks", "Impressions", "Cost", "Conversions"]

preprocessor = ColumnTransformer([
    ('num', 'passthrough', numeric_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

# Train and Evaluate One Model
logit_model = LogisticRegression(max_iter=1000, solver='liblinear')

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', logit_model)
])

# Fit the pipeline on training data
pipeline.fit(train[features], train[target])

# Now, ready to predict on Test set

test['pred_seg'] = pipeline.predict(test[features])

# Create confusion matrix and compute accuracy
conf_matrix = confusion_matrix(test[target], test['pred_seg'])
accuracy = accuracy_score(test[target], test['pred_seg'])

# Print results
print(f"\nModel: Logistic regression")
print("Confusion Matrix:\n", conf_matrix)
print("Accuracy:", accuracy)
ROAS_segment
High_ROAS       798
Low_ROAS        774
Medium_ROAS     773
Name: count, dtype: int64

Model: Logistic regression
Confusion Matrix:
 [[79 64 23]
 [49 81 27]
 [84 63 23]]
Accuracy: 0.3711967545638945
  • Use Trained Model to Predict Future Customers
# Load the prospective customer data
prospect_df = new.copy()

# Use the same preprocessing pipeline
# Only select features that match the training data
prospect_features = prospect_df[features]
  • Now, use the trained model to make prediction
# Make predictions on the unseen data
predicted = pipeline.predict(prospect_features) 

# Convert Pandas -> Polars (since you want to use Polars APIs below)
prospect_pl = pl.from_pandas(prospect_df)

# Decode numeric predictions back to string labels using the SAME label_encoder
classes = label_encoder.classes_               
pred_labels = pd.Series(predicted).map(dict(enumerate(classes))).astype(str)

# Add both the numeric code and the readable label
prospect_pl = prospect_pl.with_columns(
    pl.Series("predicted_segment_code", predicted),  
    pl.Series("predicted_segment", pred_labels)       
)

# Display some of the predictions
print(prospect_pl.select("predicted_segment").head(15))
shape: (15, 1)
┌───────────────────┐
│ predicted_segment │
│ ---               │
│ str               │
╞═══════════════════╡
│ High_ROAS         │
│ Low_ROAS          │
│ High_ROAS         │
│ High_ROAS         │
│ High_ROAS         │
│ …                 │
│ High_ROAS         │
│ High_ROAS         │
│ Low_ROAS          │
│ Low_ROAS          │
│ High_ROAS         │
└───────────────────┘

-Analyzing Profitability –Sale_Amount by Segment

# Analyze which segment is most profitable
profitability = (
    df
    .group_by('ROAS_segment')
    .agg([
        pl.col('Sale_Amount').mean().alias('avg_Sale_Amount'),
        pl.len().alias('num_customers')
        ])
    .with_columns(pl.col('ROAS_segment'))
    .sort('avg_Sale_Amount')
)

profitability
shape: (3, 3)
ROAS_segment avg_Sale_Amount num_customers
str f64 u32
"Low_ROAS" 1188.479328 774
"Medium_ROAS" 1509.054334 773
"High_ROAS" 1787.700501 798
  • Sale_Amount Estimation
# Convert the Pandas prospective customer dataframe to Polars 
prospect_pl = pl.from_pandas(prospect_df)

# Convert the numeric predictions (0/1/2) back to their label strings
classes = label_encoder.classes_   # e.g., array(['High_ROAS', 'Low_ROAS', 'Medium_ROAS'])
pred_labels = pd.Series(predicted).map(dict(enumerate(classes)))

# Add the decoded labels back to the prospective customer table
prospect_pl = prospect_pl.with_columns(
    pl.Series("predicted_segment", pred_labels.astype(str))
)

# Ensure the right table’s join key is also of string type
profitability = profitability.with_columns(
    pl.col("ROAS_segment").cast(pl.Utf8)
)

# Join both tables on the segment label, fill missing values with 0, sum up the average Sale_Amount, and extract the single scalar value
Sale_Amount_estimate = (
    prospect_pl
    .join(
        profitability,
        left_on="predicted_segment",   # String label
        right_on="ROAS_segment",       # String label
        how="left"
    )
    .select(pl.col("avg_Sale_Amount").fill_null(0))
    .sum()
    .item()   # Extract scalar
)

print(Sale_Amount_estimate)
print(f"Estimated total Sale_Amount = {Sale_Amount_estimate:,.2f}")
1052051.8420157994
Estimated total Sale_Amount = 1,052,051.84
  • Compare true amount vs model-estimated amount
# 1) TRUE total Sale_Amount from prospect data
true_total_sale = (
    prospect_pl
    .select(pl.col("Sale_Amount").fill_null(0))
    .sum()
    .item()
)

print(f"True total Sale_Amount from prospect data = {true_total_sale:,.2f}")

# 2) Compare true amount vs model-estimated amount
print("\n===== Comparison =====")
print(f"Model Estimated Sale_Amount : {Sale_Amount_estimate:,.2f}")
print(f"Actual Sale_Amount         : {true_total_sale:,.2f}")

# 3) Optional: calculate difference & percentage error

difference = Sale_Amount_estimate - true_total_sale
pct_error = difference / true_total_sale * 100 if true_total_sale != 0 else None

print("\n===== Error Metrics =====")
print(f"Difference (Estimate - Actual): {difference:,.2f}")
print(f"Percentage Error: {pct_error:.2f}%")
True total Sale_Amount from prospect data = 1,052,661.00

===== Comparison =====
Model Estimated Sale_Amount : 1,052,051.84
Actual Sale_Amount         : 1,052,661.00

===== Error Metrics =====
Difference (Estimate - Actual): -609.16
Percentage Error: -0.06%

Targeting Conclusion:

  • What we did? –The original dataset was split into three subsets (train + test = old), and (prospect = new), to simulate a real-world targeting workflow. The outcome variable, ROAS_segment, was created by classifying Return on Ad Spend (ROAS) into three categories based on quantile thresholds: High_ROAS, Medium_ROAS, Low_ROAS. After encoding these labels and preparing the features (Device, Clicks, Impressions, Cost, and Conversions), a logistic regression model was trained using the training data and was evaluated on the test set. The predicted ROAS_segment was predicted using the trained model for the prospective customer subset and analyzed profit by segment.

  • What we got? –The model achieved Accuracy: 37%, which is slightly above random guessing for a 3-class classification problem (Baseline accuracy ≈ 33.3%). This shows that the chosen features (“Devices,”Clicks”, “Impressions”, “Cost”, “Conversions”) do not strongly separate ROAS levels. Using quantiles creares artifical classes that do not reflect natual consumer behvaior patterns. Even though the classification accuracy is low, the estimated total “Sales_Amount” is close to the actualy total: Estimated: 1,052,051.84Actual: 1,052,661.00 `Error: –609 (only –0.06%)

  1. Segment average “Sale_Amount” values are very similar.
  2. Misclassifying thousands of users across 3 segments creates a “cancelling-out” effect.
  3. The segmentation might not represent truly different customer behaviors.

BCG MATRIX ANALYSIS

print('='*80)
print('BCG Matrix Anlysis')
print('='*80)

if df["Ad_Date"].dtype == pl.Utf8:
    df = df.with_columns(
        pl.col("Ad_Date").str.to_date(format="%Y-%m-%d", strict=False)
    )

analysis_date = df.select(pl.col('Ad_Date').max()).item() + timedelta(days=1)

## Analysis Date we are using: Latest Date +1 Day 
analysis_date = df.select(pl.col('Ad_Date').max()).item() + timedelta(days=1)
print(f"Analysis Date: {analysis_date} is used in our analysis")

print('='*80)
print('Step1: Calculate BCG index')
print('='*80)
bcg = df.group_by('Ad_ID').agg([
    (pl.lit(analysis_date) - pl.col('Ad_Date').max()).dt.total_days().alias('Recency'),
    pl.col('Conversions').sum().alias('Frequency'),
    pl.col('Sale_Amount').sum().alias('Monetary'),
    pl.col('Cost').sum().alias('Total_Cost'),
    pl.col('Clicks').sum().alias('Total_Clicks'),
    pl.col('Impressions').sum().alias('Total_Impression'),
    pl.col('Leads').sum().alias('Total_Leads'),
    pl.col('CPC').mean().alias('Avg_CPC'),
    pl.col('CVR').mean().alias('Avg_CVR'),
    pl.col('CPA').mean().alias('Avg_CPA'),
    pl.col('ROAS').mean().alias('Avg_ROAS'),
    pl.col('CTR').mean().alias('Avg_CTR')
])
print(bcg.head())

## Finding each variables weights
X = bcg[[
 'Recency',
 'Frequency',
 'Monetary',
 'Total_Cost',
 'Total_Clicks',
 'Total_Impression',
 'Total_Leads',
 'Avg_CPC',
 'Avg_CVR',
 'Avg_CPA',
 'Avg_ROAS',
 'Avg_CTR']]

X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=1)
pca.fit(X_scaled)
weights = pl.DataFrame({
    'Feature': X.columns,
    'PCA_Weight': pca.components_[0]
})


print("PCA weights:", pca.components_[0],weights)
## We can get the result that: CVR has highest coefficient score, CPC has second
## highest, and ROAS has very small coefficient

print('='*80)
print('Step2: Calculate Derived Indicators')
print('='*80)

bcg = bcg.with_columns([
    # ROI: (Monetary - Cost) / Cost *100%
    ((pl.col('Monetary') - pl.col('Total_Cost')) / pl.col('Total_Cost')*100).alias('ROI'),
    # Profit
    (pl.col('Monetary') - pl.col('Total_Cost')).alias('Profit'),
    # Efficiency Score for Ad Placement
    (pl.col('Avg_CVR') * 100 * 0.5 + (1/pl.col('Avg_CPC')) * 0.3 + pl.col('Avg_ROAS') * 0.2).alias('Efficiency_Score')
])
print('Complete Calculate ROI, Profit and Efficiency Score')

print('='*80)
print('Step3: Establish A Multi-Dimensional Scoring System')
print('='*80)


bcg = bcg.with_columns([
    # R_Score: Recency smaller = better
    pl.col('Recency').qcut(
        [0.2, 0.4, 0.6, 0.8],
        labels=['5','4','3','2','1'],
        allow_duplicates=True
    ).alias('R_Score'),

    # F_Score: Frequency bigger = better
    pl.col('Frequency').qcut(
        [0.2, 0.4, 0.6, 0.8],
        labels=['1','2','3','4','5'],
        allow_duplicates=True
    ).alias('F_Score'),

    # M_Score: Monetary bigger = better
    pl.col('Monetary').qcut(
        [0.2, 0.4, 0.6, 0.8],
        labels=['1','2','3','4','5'],
        allow_duplicates=True
    ).alias('M_Score'),

    # ROAS_Score: bigger = better
    pl.col('Avg_ROAS').qcut(
        [0.2, 0.4, 0.6, 0.8],
        labels=['1','2','3','4','5'],
        allow_duplicates=True
    ).alias('ROAS_Score'),

    # CVR_Score: bigger = better
    pl.col('Avg_CVR').qcut(
        [0.2, 0.4, 0.6, 0.8],
        labels=['1','2','3','4','5'],
        allow_duplicates=True
    ).alias('CVR_Score'),

    # CTR_Score: bigger = better
    pl.col('Avg_CTR').qcut(
        [0.2, 0.4, 0.6, 0.8],
        labels=['1','2','3','4','5'],
        allow_duplicates=True
    ).alias('CTR_Score'),

    # CPC_Score: smaller = better
    pl.col('Avg_CPC').qcut(
        [0.2, 0.4, 0.6, 0.8],
        labels=['5','4','3','2','1'],
        allow_duplicates=True
    ).alias('CPC_Score'),

    # CPA_Score: smaller = better
    pl.col('Avg_CPA').qcut(
        [0.2, 0.4, 0.6, 0.8],
        labels=['5','4','3','2','1'],
        allow_duplicates=True
    ).alias('CPA_Score'),

    # ROI_Score: bigger = better
    pl.col('ROI').qcut(
        [0.2, 0.4, 0.6, 0.8],
        labels=['1','2','3','4','5'],
        allow_duplicates=True
    ).alias('ROI_Score'),
])

print('Complete 9 dimension scoring')

# Confirming score columns are numeric ints 

score_columns = [
    "R_Score", "F_Score", "M_Score",
    "ROAS_Score", "CVR_Score", "CTR_Score",
    "CPC_Score", "CPA_Score", "ROI_Score"
]

bcg = bcg.with_columns([
    pl.col(col).cast(pl.Utf8).cast(pl.Int32).alias(col)
    for col in score_columns
])

# Recompute RFM_Total, Performance_Total, Overall_Score
bcg = bcg.with_columns([
    # RFM Overall Score
    (pl.col("R_Score") + pl.col("F_Score") + pl.col("M_Score")).alias("RFM_Total"),
    # Efficiency Overall Score
    (
        pl.col("ROAS_Score") + pl.col("CVR_Score") + pl.col("CTR_Score") +
        pl.col("CPC_Score")  + pl.col("CPA_Score") + pl.col("ROI_Score")
    ).alias("Performance_Total"),
])

bcg = bcg.with_columns(
    (pl.col("RFM_Total") * 0.6 + pl.col("Performance_Total") * 0.4)
    .alias("Overall_Score")
)

print('='*80)
print('Step4: Segmentation Base On BCG Dimension')
print('='*80)

# Potential score

bcg = bcg.with_columns(
    (pl.col("R_Score") + pl.col("CVR_Score") + pl.col("F_Score"))
    .alias("Potential_Score")
)

# Dynamic thresholds

print('Calculate the dynamic clustering threshold:')

performance_q60 = bcg.select(pl.col("Overall_Score").quantile(0.6)).item()
performance_q40 = bcg.select(pl.col("Overall_Score").quantile(0.4)).item()

potential_q60 = bcg.select(pl.col("Potential_Score").quantile(0.6)).item()
potential_q40 = bcg.select(pl.col("Potential_Score").quantile(0.4)).item()

performance_median = bcg.select(pl.col("Overall_Score").median()).item()
potential_median = bcg.select(pl.col("Potential_Score").median()).item()

print(f"Overall_Score quantiles:")
print(f"  Performance 60%: {performance_q60:.2f}")
print(f"  Performance 40%: {performance_q40:.2f}")
print(f"  Potential 60%: {potential_q60:.2f}")
print(f"  Potential 40%: {potential_q40:.2f}")

# BCG segmentation

bcg = bcg.with_columns(
    pl.when(
        (pl.col("Overall_Score") >= performance_q60) &
        (pl.col("Potential_Score") >= potential_q60) &
        (pl.col("ROI") > 0)
    ).then(pl.lit("Stars"))
    .when(
        (pl.col("Overall_Score") >= performance_q60) &
        (pl.col("Potential_Score") < potential_q40)
    ).then(pl.lit("Cash Cows"))
    .when(
        (pl.col("Overall_Score") < performance_q40) &
        (pl.col("Potential_Score") >= potential_q60)
    ).then(pl.lit("Question Marks"))
    .when(
        (pl.col("Overall_Score") < performance_q40) &
        (pl.col("Potential_Score") < potential_q40)
    ).then(pl.lit("Dogs"))
    .otherwise(pl.lit("Middle Ground"))
    .alias("BCG_Segment")
)

print("\nDistribution of Each Groups:")
segment_dist = bcg.group_by("BCG_Segment").count().sort("count", descending=True)
print(segment_dist)

## VISUALIZATIONS

print('='*80)
print("Step5: Visualization")
print('='*80)

print('#1: BCG Matrix')
bcg_pd = bcg.to_pandas()

plt.figure(figsize=(12, 10))

colors = {
    'Stars': '#FFD700',        
    'Cash Cows': '#32CD32',    
    'Question Marks': '#FFA500', 
    'Dogs': '#DC143C'           
}

for segment in ['Stars','Cash Cows','Question Marks','Dogs']:
    segment_data = bcg_pd[bcg_pd['BCG_Segment'] == segment]
    if len(segment_data) > 0:  
        plt.scatter(segment_data['Potential_Score'], 
                   segment_data['Overall_Score'],
                   label=f"{segment} (n={len(segment_data)})",
                   alpha=0.6, 
                   s=segment_data['Monetary']/10,  # Bubble Size = Monetary Size
                   c=colors[segment],
                   edgecolors='black',
                   linewidth=0.5)
# Adding the Median Line
plt.axhline(y=performance_median, color='gray', linestyle='--', linewidth=1.5, alpha=0.7, label='Performance Median')
plt.axvline(x=potential_median, color='gray', linestyle='--', linewidth=1.5, alpha=0.7, label='Potential Median')

# Quadrant Labels
y_range = bcg_pd['Overall_Score'].max() - bcg_pd['Overall_Score'].min()
x_range = bcg_pd['Potential_Score'].max() - bcg_pd['Potential_Score'].min()

plt.text(potential_median + x_range*0.05, performance_median + y_range*0.05, 
         'Stars\n(Invest & Grow)', 
         fontsize=11, alpha=0.4, fontweight='bold',
         bbox=dict(boxstyle='round', facecolor='gold', alpha=0.2))

plt.text(potential_median - x_range*0.25, performance_median + y_range*0.05, 
         'Cash Cows\n(Harvest)', 
         fontsize=11, alpha=0.4, fontweight='bold',
         bbox=dict(boxstyle='round', facecolor='green', alpha=0.2))

plt.text(potential_median + x_range*0.05, performance_median - y_range*0.25, 
         'Question Marks\n(Selective Invest)', 
         fontsize=11, alpha=0.4, fontweight='bold',
         bbox=dict(boxstyle='round', facecolor='orange', alpha=0.2))

plt.text(potential_median - x_range*0.25, performance_median - y_range*0.25, 
         'Dogs\n(Divest)', 
         fontsize=11, alpha=0.4, fontweight='bold',
         bbox=dict(boxstyle='round', facecolor='red', alpha=0.2))

plt.xlabel('Potential Score', fontsize=12, fontweight='bold')
plt.ylabel('Overall Score', fontsize=12, fontweight='bold')
plt.title('BCG Matrix\n(Bubble Size = Monetary)', fontsize=14, fontweight='bold', pad=20)
plt.legend(loc='best', fontsize=9, framealpha=0.9)
plt.grid(True, alpha=0.3, linestyle=':')
plt.tight_layout()
plt.show()

print('#2: Total Revenue and COst analysis for each Quadrant')
segment_summary = bcg_pd.groupby('BCG_Segment').agg({
    'Monetary':'sum',
    'Total_Cost':'sum',
    'Profit':'sum'
}).reindex(['Stars','Cash Cows','Question Marks','Dogs'])

plt.figure(figsize=(12, 10))

# Revenue vs Cost
x = range(len(segment_summary))
width = 0.35

bars1 = plt.bar([i - width/2 for i in x], segment_summary['Monetary'], 
                width, label='Revenue', color='#4CAF50', edgecolor='black')
bars2 = plt.bar([i + width/2 for i in x], segment_summary['Total_Cost'], 
                width, label='Cost', color='#FF5722', edgecolor='black')

plt.xlabel('BCG Dimension', fontsize=12, fontweight='bold')
plt.ylabel('Amount ($)', fontsize=12, fontweight='bold')
plt.title('Revenue vs Total Cost', fontsize=14, fontweight='bold')
plt.xticks(ticks = x, labels=segment_summary.index, fontsize=10)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3, axis='y', linestyle=':')

# Adding Labels
for bar in bars1:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
            f'${height:,.0f}', ha='center', va='bottom', fontsize=9)
for bar in bars2:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
            f'${height:,.0f}', ha='center', va='bottom', fontsize=9)

plt.show()

print('#3: Radar Chart of Performance Indicators in Each Dimension')

# Calculate each dimension's efficiency(Standardize to 0~1)
metrics = ['R_Score', 'F_Score', 'M_Score', 'ROAS_Score', 'CVR_Score', 'ROI_Score']
segment_metrics = bcg_pd.groupby('BCG_Segment')[metrics].mean()
segment_metrics = segment_metrics.reindex(['Stars', 'Cash Cows', 'Question Marks', 'Dogs'])

# Standardize 1~5
segment_metrics_norm = segment_metrics

fig, ax = plt.subplots(figsize=(12, 12), subplot_kw=dict(projection='polar'))

# Setting angles
angles = np.linspace(0, 2 * np.pi, len(metrics), endpoint=False).tolist()
angles += angles[:1]  

# Draw each dimension
colors_radar = {
    'Stars': '#FFD700',
    'Cash Cows': '#32CD32',
    'Question Marks': '#FFA500',
    'Dogs': '#DC143C'
}

for segment in segment_metrics_norm.index:
    values = segment_metrics_norm.loc[segment].tolist()
    values += values[:1] 
    
    ax.plot(angles, values, 'o-', linewidth=2.5, 
            label=segment, color=colors_radar[segment])
    ax.fill(angles, values, alpha=0.15, color=colors_radar[segment])

# Adding Labels and Title
ax.set_xticks(angles[:-1])
ax.set_xticklabels(['Recency', 'Frequency', 'Monetary', 
                    'ROAS', 'CVR', 'ROI'], 
                   fontsize=11)
ax.set_ylim(0, 5)
ax.set_yticks([1, 2, 3, 4, 5])
ax.set_yticklabels(['1', '2', '3', '4', '5'], fontsize=9)
ax.grid(True, linestyle=':', alpha=0.7)

ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1), fontsize=11, framealpha=0.95)
plt.title('Radar Chart for Each BCG Dimension (5 Quintile:5=Best,1=Worst)', 
          fontsize=16, fontweight='bold', pad=30)
plt.show()
================================================================================
BCG Matrix Anlysis
================================================================================
Analysis Date: 2024-12-12 is used in our analysis
================================================================================
Step1: Calculate BCG index
================================================================================
shape: (5, 13)
┌───────┬─────────┬───────────┬──────────┬───┬─────────┬─────────┬──────────┬─────────┐
│ Ad_ID ┆ Recency ┆ Frequency ┆ Monetary ┆ … ┆ Avg_CVR ┆ Avg_CPA ┆ Avg_ROAS ┆ Avg_CTR │
│ ---   ┆ ---     ┆ ---       ┆ ---      ┆   ┆ ---     ┆ ---     ┆ ---      ┆ ---     │
│ str   ┆ i64     ┆ f64       ┆ f64      ┆   ┆ f64     ┆ f64     ┆ f64      ┆ f64     │
╞═══════╪═════════╪═══════════╪══════════╪═══╪═════════╪═════════╪══════════╪═════════╡
│ A1935 ┆ 23      ┆ 4.0       ┆ 1937.0   ┆ … ┆ 0.03    ┆ 60.54   ┆ 7.999504 ┆ 0.04    │
│ A2272 ┆ 38      ┆ 3.0       ┆ 1505.0   ┆ … ┆ 0.02    ┆ 74.43   ┆ 6.739812 ┆ 0.03    │
│ A2368 ┆ 16      ┆ 5.0       ┆ 1364.0   ┆ … ┆ 0.03    ┆ 39.26   ┆ 6.948548 ┆ 0.03    │
│ A1671 ┆ 15      ┆ 7.0       ┆ 1524.0   ┆ … ┆ 0.06    ┆ 30.32   ┆ 7.179535 ┆ 0.02    │
│ A2387 ┆ 19      ┆ 4.0       ┆ 1241.0   ┆ … ┆ 0.03    ┆ 56.38   ┆ 5.502594 ┆ 0.03    │
└───────┴─────────┴───────────┴──────────┴───┴─────────┴─────────┴──────────┴─────────┘
PCA weights: [-0.01287908  0.29455573 -0.04803725 -0.00586516 -0.45272384  0.10559961
 -0.04624166  0.43694897  0.50411978 -0.28304628 -0.04045902 -0.40702253] shape: (12, 2)
┌──────────────┬────────────┐
│ Feature      ┆ PCA_Weight │
│ ---          ┆ ---        │
│ str          ┆ f64        │
╞══════════════╪════════════╡
│ Recency      ┆ -0.012879  │
│ Frequency    ┆ 0.294556   │
│ Monetary     ┆ -0.048037  │
│ Total_Cost   ┆ -0.005865  │
│ Total_Clicks ┆ -0.452724  │
│ …            ┆ …          │
│ Avg_CPC      ┆ 0.436949   │
│ Avg_CVR      ┆ 0.50412    │
│ Avg_CPA      ┆ -0.283046  │
│ Avg_ROAS     ┆ -0.040459  │
│ Avg_CTR      ┆ -0.407023  │
└──────────────┴────────────┘
================================================================================
Step2: Calculate Derived Indicators
================================================================================
Complete Calculate ROI, Profit and Efficiency Score
================================================================================
Step3: Establish A Multi-Dimensional Scoring System
================================================================================
Complete 9 dimension scoring
================================================================================
Step4: Segmentation Base On BCG Dimension
================================================================================
Calculate the dynamic clustering threshold:
Overall_Score quantiles:
  Performance 60%: 13.00
  Performance 40%: 11.40
  Potential 60%: 9.00
  Potential 40%: 8.00

Distribution of Each Groups:
shape: (5, 2)
┌────────────────┬───────┐
│ BCG_Segment    ┆ count │
│ ---            ┆ ---   │
│ str            ┆ u32   │
╞════════════════╪═══════╡
│ Stars          ┆ 710   │
│ Middle Ground  ┆ 700   │
│ Dogs           ┆ 593   │
│ Question Marks ┆ 205   │
│ Cash Cows      ┆ 137   │
└────────────────┴───────┘
================================================================================
Step5: Visualization
================================================================================
#1: BCG Matrix

#2: Total Revenue and COst analysis for each Quadrant

#3: Radar Chart of Performance Indicators in Each Dimension

BCG Matrix Analysis Conclusion:

What We Did? We began by using PCA weights to examine the overall structure of all advertising and performance metrics, allowing us to identify which variables truely differentiate performance across Ad_ID.

Based on the importance revealed through PCA, we selected the business-meaningful KPIs, including RFM indicators and core efficiency metrics, and constructed a BCG scoring system.

Within this scoring framework, we applied quintile-based thresholds (20%, 40%, 60%, 80%) to standardize each KPI into a 1-5 scoring scale.

We then calculated each Ad_Id’s Overall Score using both the RFM Scores and Performance Scores. Finally, we used the 40% and 60% quantile thresholds of the Overall Score and Potential Score as dynamic cut-off points to provide a objective segmentation basis for the BCG Matrix.

Deliverables & Insights Based on the BCG Matrix segmentation and the supporting visualizations, we derived the following insights while considering both fianacial outcomes and efficiency matrics: - Stars: Clear top-priority investment targets with the highest returns and strongest efficiency. - Cash Cows: Provide stable returns but limited growth potential. Maintaining current investment level is sufficient. - Question Marks: Currently generate lower revenue but show high upside potential. Require strategic investment to unlick their performance. - Dogs: Despite appearing to generate high revenue, these ads show poor efficienty. They should be prioritized for optimization or considered for withdrawal gradually.

RANDOM FOREST MODEL

# Read cleaned CSV into a separate Pandas dataframe for RF ONLY
df_rf = pd.read_csv(file_path)  # do NOT overwrite df
df_rf = df_rf.drop(columns="Ad_ID", errors="ignore")

# Feature Engineering
# Rolling averages for rate/cost metrics
roll_cols = ["CTR", "CPC", "CVR", "CPA"]
if "Ad_Date" in df_rf.columns:
    df_rf["Ad_Date"] = pd.to_datetime(df_rf["Ad_Date"], errors="coerce")
    df_rf = df_rf.sort_values("Ad_Date")
    for c in roll_cols:
        if c in df_rf.columns:
            df_rf[f"{c}_roll7"] = df_rf[c].rolling(window=7, min_periods=1).mean()
else:
    for c in roll_cols:
        if c in df_rf.columns:
            df_rf[f"{c}_roll7"] = df_rf[c]

# Drop date columns from RF feature set ONLY
df_rf = df_rf.drop(columns=["Ad_Date", "Ad_Date_str"], errors="ignore")

# Build feature matrix X and target y
drop_cols = ["Sale_Amount", "ROAS"]
X = df_rf.drop(columns=drop_cols, errors="ignore")
y = df_rf["ROAS"].astype(float)

# Encode categoricals
for col in X.select_dtypes(include=["object", "category"]).columns:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col].astype(str))

# Fill remaining NaNs with median
X = X.fillna(X.median(numeric_only=True))

# Train/Validation Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
  • Feature engineering and Data Preparation: Before model fitting, a few steps of feature engineering were added so that the model can capture more potential relationships between the variables we have. – Implemented 7-day rolling averages for key performance indicators (CTR, CPC, CVR, CPA) to capture trend patterns and reduce day-to-day volatility. These rolling features help the model understand momentum in campaign performance. The value of the 7-day rolling average represents the mean (average) of the current day’s value and the preceding six days’ values. Each subsequent day, the oldest data point drops off, and the newest data point is added. – Removed the Ad_ID identifier column as it provides no predictive value. – Dropped temporal columns (Ad_Date, Ad_Date_str) from the feature matrix to prevent the model to learn from specific dates. – Applied Label Encoding to categorical variables (primarily the Device column). – Filled remaining missing values using median imputation to maintain distributional properties.

  • Model Fitting – The target variable was ROAS (Return on Ad Spend), a critical metric measuring revenue generated per dollar spent on advertising. – The feature matrix excluded both ROAS (target) and Sale_Amount (directly related to target calculation) to prevent information leakage. This resulted in 22 potential predictor variables including the engineered features. – Applied a 70-30 train-test split with a fixed random seed (42) for reproducibility, yielding approximately 1,641 training samples and 704 test samples.

# Baseline RandomForest
rf_base = RandomForestRegressor(
    n_estimators=300, max_depth=None, random_state=42, n_jobs=-1
)
rf_base.fit(X_train, y_train)

# Feature Selection
selector = SelectFromModel(rf_base, threshold="median", prefit=True)
X_train_sel = selector.transform(X_train)
X_test_sel  = selector.transform(X_test)
selected_features = X.columns[selector.get_support()]

# Final RF Model
rf_final = RandomForestRegressor(
    n_estimators=300, max_depth=None, random_state=42, n_jobs=-1
)
rf_final.fit(X_train_sel, y_train)

# Evaluate
y_pred = rf_final.predict(X_test_sel)
mae = mean_absolute_error(y_test, y_pred)
r2  = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Random Forest MAE: {mae:.4f}")
print(f"Random Forest RMSE: {rmse:.4f}")
print(f"Random Forest R² : {r2:.4f}")
Random Forest MAE: 1.1985
Random Forest RMSE: 1.4205
Random Forest R² : 0.1848
  • Feature Selection The baseline model is going to include all the features – Baseline Model: Trained an initial Random Forest with 300 trees and unlimited depth on all available features. – Feature Selection: Used SelectFromModel with a median importance threshold, retaining only features with above-median importance scores from the baseline model. – Final Model: Retrained a Random Forest using only the selected features.
# Cross-Validation
X_selected_full = selector.transform(X)
cv_scores = cross_val_score(rf_final, X_selected_full, y, cv=5, scoring="r2")
cv_mean = float(np.mean(cv_scores))
cv_std  = float(np.std(cv_scores))
print(f"5-fold CV R²: {cv_mean:.4f} ± {cv_std:.4f}")

# Feature Importances
importances = pd.Series(rf_final.feature_importances_, index=selected_features).sort_values(ascending=False)
top5 = importances.head(5)
print("\nTop 5 Feature Importances:")
print(top5)

plt.figure()
top5.plot(kind="bar")
plt.title("Top 5 Feature Importances (Final RF)")
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.tight_layout()
plot_path = Path("top5_feature_importances.png")
plt.savefig(plot_path, dpi=300)
plt.show()
5-fold CV R²: 0.1962 ± 0.0283

Top 5 Feature Importances:
Cost           0.337651
Impressions    0.092001
CPA_roll7      0.083847
CPC_roll7      0.079827
CPA            0.076720
dtype: float64

  • Random Forest Model Results: Mean Absolute Error (MAE): 1.1985 R² Score: 0.1848 5-Fold Cross-Validation R²: 0.1962 ± 0.0283

ROAS ANALYSIS

  • Overall ROAS on Cleaned Data
# Aggregate totals
totals = df.select([
    pl.col("Cost").sum().alias("total_cost"),
    pl.col("Sale_Amount").sum().alias("total_revenue")
])

total_cost = totals["total_cost"][0]
total_revenue = totals["total_revenue"][0]

romi = (total_revenue - total_cost) / total_cost
roas = total_revenue / total_cost

print(f"Total Cost: ${total_cost:,.2f}")
print(f"Total Revenue: ${total_revenue:,.2f}")
print(f"ROAS: {roas:.2f}x")
print(f"ROMI: {romi*100:.1f}%")
Total Cost: $504,331.33
Total Revenue: $3,512,967.00
ROAS: 6.97x
ROMI: 596.6%
  • ROAS Segmented by Device
# ROAS Segmented by Device
roas_by_device = (
    df.group_by("Device")
      .agg([
          pl.col("Cost").sum().alias("total_cost"),
          pl.col("Sale_Amount").sum().alias("total_revenue")
      ])
      .with_columns([
          (pl.col("total_revenue") / pl.col("total_cost")).alias("ROAS"),
          ((pl.col("total_revenue") - pl.col("total_cost")) / pl.col("total_cost")).alias("ROMI")
      ])
      .sort("ROAS", descending=True)   # sort by ROAS since this is a ROAS chart
)

print(roas_by_device)

# Convert to something matplotlib can work with easily
devices = roas_by_device["Device"].to_list()
roas_vals = roas_by_device["ROAS"].to_list()

# Visualize ROAS Segmented by Device
plt.figure(figsize=(8, 6))
plt.bar(
    devices,
    roas_vals,
    color=["#6baed6", "#74c476", "#fd8d3c"],
    edgecolor="black"
)
plt.xlabel("Device", fontsize=12)
plt.ylabel("ROAS (x)", fontsize=12)
plt.title("Return on Ad Spend (ROAS) by Device", fontsize=14, fontweight="bold")
plt.grid(axis="y", alpha=0.3)

# Add value labels (e.g., 6.97x)
for i, val in enumerate(roas_vals):
    plt.text(i, val + 0.05, f"{val:.2f}x", ha="center", fontsize=11, fontweight="bold")

plt.tight_layout()
plt.show()
shape: (3, 5)
┌─────────┬────────────┬───────────────┬──────────┬──────────┐
│ Device  ┆ total_cost ┆ total_revenue ┆ ROAS     ┆ ROMI     │
│ ---     ┆ ---        ┆ ---           ┆ ---      ┆ ---      │
│ str     ┆ f64        ┆ f64           ┆ f64      ┆ f64      │
╞═════════╪════════════╪═══════════════╪══════════╪══════════╡
│ Desktop ┆ 171333.18  ┆ 1.197615e6    ┆ 6.989977 ┆ 5.989977 │
│ Tablet  ┆ 161796.47  ┆ 1.129394e6    ┆ 6.980338 ┆ 5.980338 │
│ Mobile  ┆ 171201.68  ┆ 1.185958e6    ┆ 6.927257 ┆ 5.927257 │
└─────────┴────────────┴───────────────┴──────────┴──────────┘

  • ROAS Segmented by Day of the Week
# ROAS segmented by day of the week
roas_by_day = (
    df.group_by("day_of_week")
      .agg([
          pl.col("Cost").sum().alias("Total_Cost"),
          pl.col("Sale_Amount").sum().alias("Total_Revenue")
      ])
      .with_columns([
          (pl.col("Total_Revenue") / pl.col("Total_Cost")).alias("ROAS"),
          ((pl.col("Total_Revenue") - pl.col("Total_Cost")) / pl.col("Total_Cost")).alias("ROMI")
      ])
      .sort("ROAS", descending=True)
)

print(roas_by_day)

# Extract values for plotting
days = roas_by_day["day_of_week"].to_list()
roas_vals = roas_by_day["ROAS"].to_list()

# Visualize ROAS segmented by day of the week
plt.figure(figsize=(8, 6))
plt.bar(days, roas_vals, color="skyblue", edgecolor="black")
plt.xlabel("Day of Week")
plt.ylabel("ROAS (x)")
plt.title("Return on Ad Spend (ROAS) by Day of Week")
plt.grid(axis="y", alpha=0.3)

# Add value labels
for i, val in enumerate(roas_vals):
    plt.text(i, val + 0.05, f"{val:.2f}x", ha="center", fontsize=10, fontweight="bold")

plt.tight_layout()
plt.show()
shape: (7, 5)
┌─────────────┬────────────┬───────────────┬──────────┬──────────┐
│ day_of_week ┆ Total_Cost ┆ Total_Revenue ┆ ROAS     ┆ ROMI     │
│ ---         ┆ ---        ┆ ---           ┆ ---      ┆ ---      │
│ str         ┆ f64        ┆ f64           ┆ f64      ┆ f64      │
╞═════════════╪════════════╪═══════════════╪══════════╪══════════╡
│ Monday      ┆ 69276.39   ┆ 488381.0      ┆ 7.049747 ┆ 6.049747 │
│ Thursday    ┆ 83947.07   ┆ 589864.0      ┆ 7.026618 ┆ 6.026618 │
│ Saturday    ┆ 75915.16   ┆ 533312.0      ┆ 7.025105 ┆ 6.025105 │
│ Friday      ┆ 78078.18   ┆ 543547.0      ┆ 6.961574 ┆ 5.961574 │
│ Sunday      ┆ 62135.5    ┆ 429475.0      ┆ 6.91191  ┆ 5.91191  │
│ Wednesday   ┆ 69571.48   ┆ 478628.0      ┆ 6.879658 ┆ 5.879658 │
│ Tuesday     ┆ 65407.55   ┆ 449760.0      ┆ 6.87627  ┆ 5.87627  │
└─────────────┴────────────┴───────────────┴──────────┴──────────┘

  • This heatmap visualizes what day and device would optimize the ROAS.
# Ensure day_of_week exists; derive from Ad_Date if needed
if "day_of_week" not in df.columns:
    df = df.with_columns(
        pl.col("Ad_Date")
          .str.strptime(pl.Date, format="%m/%d/%Y", strict=False)
          .dt.strftime("%A")
          .alias("day_of_week")
    )

# Compute ROAS by Device x Day of Week
roas_grid = (
    df.group_by(["day_of_week", "Device"])
      .agg([
          pl.col("Cost").sum().alias("Total_Cost"),
          pl.col("Sale_Amount").sum().alias("Total_Revenue")
      ])
      .with_columns(
          (pl.col("Total_Revenue") / pl.col("Total_Cost")).alias("ROAS")
      )
)

# Define order for days and devices
day_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
device_order = sorted(df.select("Device").unique().to_series().to_list())

# Keep only valid days 
roas_grid = roas_grid.filter(pl.col("day_of_week").is_in(day_order))

# Pivot: rows = day_of_week, columns = Device, values = ROAS
roas_pivot = (
    roas_grid
    .pivot(
        index="day_of_week",
        columns="Device",
        values="ROAS"
    )
)

# Re-order rows by day_order
roas_pivot = (
    pl.DataFrame({"day_of_week": day_order})
    .join(roas_pivot, on="day_of_week", how="inner")
)

print(roas_pivot)

# Build matrix for plotting
day_labels = roas_pivot["day_of_week"].to_list()
device_cols = [d for d in device_order if d in roas_pivot.columns]

# ROAS is already a ratio
roas_matrix = roas_pivot.select(device_cols).to_numpy()

# Plot heatmap
fig, ax = plt.subplots(figsize=(8, 5))
im = ax.imshow(roas_matrix, cmap='YlOrRd', aspect='auto')

# Axis labels
ax.set_xticks(np.arange(len(device_cols)))
ax.set_yticks(np.arange(len(day_labels)))
ax.set_xticklabels(device_cols)
ax.set_yticklabels(day_labels)

# Annotate each cell with ROAS
for i in range(len(day_labels)):
    for j in range(len(device_cols)):
        val = roas_matrix[i, j]
        if not np.isnan(val):
            ax.text(j, i, f"{val:.2f}x", ha="center", va="center", fontsize=9)

ax.set_title("ROAS by Device and Day of Week")
ax.set_xlabel("Device")
ax.set_ylabel("Day of Week")

cbar = fig.colorbar(im, ax=ax)
cbar.set_label("ROAS (x)")

plt.tight_layout()
plt.show()
shape: (7, 4)
┌─────────────┬──────────┬──────────┬──────────┐
│ day_of_week ┆ Mobile   ┆ Desktop  ┆ Tablet   │
│ ---         ┆ ---      ┆ ---      ┆ ---      │
│ str         ┆ f64      ┆ f64      ┆ f64      │
╞═════════════╪══════════╪══════════╪══════════╡
│ Wednesday   ┆ 6.917306 ┆ 6.999815 ┆ 6.728031 │
│ Sunday      ┆ 6.822471 ┆ 6.929751 ┆ 6.993048 │
│ Friday      ┆ 6.83377  ┆ 7.066956 ┆ 6.976465 │
│ Tuesday     ┆ 6.889602 ┆ 6.656144 ┆ 7.073429 │
│ Saturday    ┆ 7.036251 ┆ 7.131999 ┆ 6.913717 │
│ Monday      ┆ 7.105285 ┆ 6.99724  ┆ 7.05452  │
│ Thursday    ┆ 6.885326 ┆ 7.069116 ┆ 7.13477  │
└─────────────┴──────────┴──────────┴──────────┘

ROAS Conclusion: - Since this data is hypothetical, the ROMI would not refelct a real-life Google Ad scenario. The ROMI, [(profit-cost)/profit], from this dataset concluded to 596.6% which is relatively high. According to enhancer.com, the avarage ROMI on Google Ads across all industries is 200%. To make an actionable inisght, assets such as “Device” and “day_of_week” were used to segmet which days and devices have stronger ROMI. After seeing minimal differences in ROMI, a heatmap was created between “Device” and “day_of_week” to see which days and devcies would have the highest churn. Based on the heatmap: Saturday should focus on Desktop ads Monday should focus on Mobile ads Thursdays should focus on Tablet ads

Results and Findings

Note: The dataset is entirely simulated (from Kaggle), so the numerical results should not be interpreted as real business outcomes. However, the analytical methods and the logic of the findings remain valid and representative of real-world marketing analytics.

  1. Data Preparation & Cleaning
  • Kaggle’s simulated advertising dataset includes 2,600 rows, 13 columns, containing synthetic impressions, clicks, conversions, and sales.
  • After cleaning, 2,345 complete rows remained with engineered features include CPC, CPA, CVR, CTR, ROAS, device/day-of-week variables, and 7-day rolling means.
  1. Overall Campaign Performance
  • From the processed dataset: Total Cost: $504,331.33Total Sales: $3,521,976 Avg ROAS: 7.04×Avg CPC: $1.65 Avg CPA: $38.16Avg CVR: 5.04%
  1. Exploratory Data Analysis
  • ROAS Variability Mean ROAS ≈ 7 but with a wide almost normal distribution, suggesting mixed performance across campaigns.

  • Correlations Strong positive relationship between ROAS and Sales Amount.

  1. Segmentation Analysis(K-Means, k=4)
  • Cluster Characteristics Mirror Typical Ad Behaviors:
- Segment Clicks Impressions Cost Conversions
- 1 124 4,969 $193 5.7
- 2 104 4,322 $226 7.4
- 3 164 5,135 $232 6.3
- 4 162 3,638 $211 6.7
  • Key Findings -

Segment 2 — Strong conversions but high cost → high-potential segment. – Segment 2 shows the highest conversions (7.4) among all groups, indicating strong audience intent and solid bottom-funnel performance. However, this comes with a higher cost level ($226), suggesting that while this segment is expensive to acquire, it consistently generates valuable actions. This makes it a high-potential group worth closer monitoring and selective investment.

Segment 4 — Efficient with low impressions → ideal for scaling tests. – Segment 4 achieves strong efficiency metrics—relatively high conversions (6.7) with the lowest impressions (3,638) in the dataset. This implies that the segment is performing well despite limited reach. Because efficiency is already proven, the main unknown is scale, making Segment 4 an ideal candidate for budget-scaling tests to evaluate its incremental ROI.

Segments 1 & 3 — Mid-range performance → maintenance audiences. – Segments 1 and 3 sit in the middle across most metrics—moderate clicks, average impressions, and stable but unremarkable conversions. Their performance is neither high-risk nor high-return, making them suitable as maintenance audiences where consistent presence is beneficial, but aggressive scaling is not necessary.

  • Insight - Different creative sets, audiences, or bidding strategies are likely driving the variance; budget should prioritize segments 2 and 4.
  1. ROAS-Based Targeting Model
  • Using quantile-based ROAS segmentation:

    ROAS Segment Customers Quantile Avg Sales
    High 798 Top 33% $1,787.70
    Medium 773 Middle 33% $1,509.05
    Low 774 Bottom 33% $1,188.48
  • Confusion Matrix accuracy: 37%

  • Baseline: 33%

  • The signal is weak, meaning the chosen features (Device, Clicks, Impressions, Cost, Conversions) do not strongly separate ROAS levels.

  • Using quantiles creates artificial classes that do not reflect natural behavior patterns.

  • Even though the classification accuracy is low, the estimated total Sale_Amount is extremely close to the actual total: Error only –0.06%

  • Possible Reasons -

  1. Segment average Sale_Amount values are very similar
  2. Misclassifying thousands of users across 3 segments creates a “cancelling-out” effect.
  3. The segmentation might not represent truly different customer behaviors
  1. BCG Matrix Analysis
  • Based on Recency, Frequency, Monetary value, and average efficiency metrics (Avg_CVR, Avg_CPA, Avg_ROAS, Avg_CTR), ads show substantial variability in value. Some ads deliver strong returns, while others consistently underperform and require budget reduction.

  • BCG segmentation - Stars (710): High performance and high potential; they contribute the most revenue with consistently strong ROAS.Cash Cows (137): High ROI with low cost; these ads are the most profitable and stable. Question Marks (205): High potential but inconsistent efficiency; require further A/B testing to validate.Dogs (593): Low ROAS, low conversions, and poor return on cost; primary candidates for budget cuts. `Middle Ground (700): Moderate performance; maintain baseline spend without scaling.

  • Revenue vs. Cost - Stars generate the highest revenue (~$226k) with reasonable costs (~$30k), indicating strong overall value.Cash Cows deliver stable revenue (~$37k) at extremely low cost (~$4k), making them the most ROI-efficient segment. Dogs have poor revenue-to-cost balance (~$141k revenue vs. ~$22k cost), confirming low efficiency.Question Marks show low cost (~$8k) but meaningful revenue (~$50k), suggesting potential upside if optimized.

  • Radar chart - Stars excel across Monetary, ROI, ROAS, and overall performance indicators.Cash Cows achieve the strongest ROI, CPA, and CVR efficiency profiles. Question Marks exhibit potential in some dimensions but lack consistency.Dogs underperform across all metrics and represent the bottom tier.

  • Actionable insights - Increase investment: Stars, Cash CowsTest and evaluate: Question Marks Maintain baseline spend: Middle GroundReduce or eliminate spend: Dogs

  1. Random Forest Model
  • Feature Engineering - Created rolling averages: 7-day rolling means for CTR, CPC, CVR, CPATemporal features: Day of week, weekend/weekday indicators `Calculated metrics: CPC, CVR, CPA, ROAS, CTR already in dataset

  • Model Development - Configuration: 300 trees, unlimited depthFeature Selection: Applied median threshold selection to reduce dimensionality Split: 70% training, 30% validationValidation: 5-fold cross-validation on selected features

  • Model Performance -

  • R² = 0.185 ` model explains ~19% of ROAS variance

  • MAE = 1.199 ` The model’s MAE of 1.199 indicates that, on average, the Random Forest’s ROAS predictions deviate from the actual values by about 1.2 units.Average ROAS is around 7, corresponding to a relative error of about 17% (1.2 / 7 ≈ 0.17), often be considered acceptable..

  • Findings - Cost and impressions measure scale - how much reach and spending a campaign achieves - while CPC, CPA, CTR, and CVR represent micro-efficiency metrics that capture how efficiently each interaction converts into value.Cost and impressions dominate the prediction of return, showing that scale matters more than micro-efficiency.

  1. ROAS Analysis
  • Overall profitability is extremely strong (Average ROAS 7.04x) This means: Every $1 spent on advertising generated $7.04 in revenue.

  • All three devices deliver ROMI values near 600%:

    Device ROAS
    Desktop 6.99x
    Tablet 6.98x
    Mobile 6.93x
  • Desktop has a slight edge, but Tablet and Mobile remain nearly equally profitable. This suggests balanced device performance with no immediate need to cut any channel.

  • Day-of-week -

  • High-performing days: Monday (7.04x) Thursday (7.03x) Saturday (7.03x)

  • Lower (but still very profitable) days: Friday (6.96) Sunday (6.91x)

  • Combined device × day-of-week heatmap -

  • Highest ROAS cells: Desktop–Saturday (7.13x) Tablet–Thursday (7.13x) Mobile–Monday (7.11x)

  • Lowest ROAS cells: Desktop–Tuesday (6.66x) Tablet–Wednesday (6.73x) Mobile–Sunday (6.82x)

  • Interpretation - While all combinations remain positive, these patterns can guide tactical adjustments.Weekends and early-week days tend to be strongest for scaling; midweek combinations may benefit from creative refresh or budget moderation.

Research Implications

  • Provides a transferable analytical framework that can serve as a blueprint for evaluating digital advertising performance in real-world Google Ads campaigns.

  • Demonstrates the value and limitations of synthetic datasets, highlighting where simulated results diverge from actual consumer behavior and informing the creation of more realistic marketing data for research and education.

  • Establishes a clear, replicable analysis pipeline—from data cleaning through modeling and strategy development—that researchers and practitioners can apply to future marketing datasets.

  • Strengthens analytical skill-building, offering a structured example of how segmentation, ROAS-based evaluation, and model-driven insights can guide campaign optimization.

Managerical Recommendations & Actionable Plans

1.Prioritize the second-tier high-performing day device combinations(Thu–Desktop, Fri–Desktop, Sat–Mobile, Tue–Tablet, Mon–Tablet). These consistently deliver strong ROAS, but are not yet maxed out, making them the best opportunity for growth. - Reallocating the budget from low-performing combinations and directing the budget towards the second-tier increases overall ROAS from 6.97x to 7.04x without increasing total budget.

print("\n" + "="*80)
print("ROAS Reallocation Scenario")
print("="*80)

# Reload a clean Polars df for ROAS analysis 
df_roas = pl.read_csv(file_path)

# Ensure day_of_week exists
if "day_of_week" not in df_roas.columns:
    df_roas = df_roas.with_columns(
        pl.col("Ad_Date")
          .str.strptime(pl.Date, format="%Y-%m-%d", strict=False)
          .dt.strftime("%A")
          .alias("day_of_week")
    )

# BASELINE GLOBAL ROAS
# ------------------------------
totals = df_roas.select([
    pl.col("Cost").sum().alias("total_cost"),
    pl.col("Sale_Amount").sum().alias("total_revenue")
]).row(0)

total_cost, total_revenue = totals
baseline_roas = total_revenue / total_cost

print(f"Baseline ROAS: {baseline_roas:.2f}x\n")

# Build Device × Day ROAS Grid
# ------------------------------
roas_grid = (
    df_roas.group_by(["day_of_week", "Device"])
    .agg([
        pl.col("Cost").sum().alias("Total_Cost"),
        pl.col("Sale_Amount").sum().alias("Total_Revenue")
    ])
    .with_columns(
        (pl.col("Total_Revenue") / pl.col("Total_Cost")).alias("ROAS")
    )
)

# Redefine Weak & Strong Cells
# ------------------------------

# Weak = ROAS BELOW the global portfolio ROAS
weak_cells = roas_grid.filter(pl.col("ROAS") < baseline_roas)

# Strong = ROAS ABOVE global ROAS
strong_cells = roas_grid.filter(pl.col("ROAS") > baseline_roas)

# Top-3 best cells (kept as-is)
top_cells = strong_cells.sort("ROAS", descending=True).head(3)

# Second-tier strong performers = strong but not top 3
second_tier_cells = strong_cells.filter(
    ~(
        (pl.col("day_of_week").is_in(top_cells["day_of_week"])) &
        (pl.col("Device").is_in(top_cells["Device"]))
    )
)

print("Top Cells:")
print(top_cells)

print("\nSecond-Tier Cells:")
print(second_tier_cells)

print("\nWeak Cells:")
print(weak_cells)


# Reallocation Amount
# ------------------------------

REALLOC_FRAC = 1.00

weak_cost, weak_rev = weak_cells.select([
    pl.col("Total_Cost").sum(),
    pl.col("Total_Revenue").sum()
]).row(0)

shift_cost = weak_cost * REALLOC_FRAC
remaining_weak_cost = weak_cost * (1 - REALLOC_FRAC)
remaining_weak_revenue = weak_rev * (1 - REALLOC_FRAC)

print(f"\nReallocating {REALLOC_FRAC*100:.0f}% of weak spend → second-tier")
print(f"Shift Cost: ${shift_cost:,.2f}")


# Simulate new revenue in second-tier
# ------------------------------
second_cost, second_rev = second_tier_cells.select([
    pl.col("Total_Cost").sum(),
    pl.col("Total_Revenue").sum()
]).row(0)

new_second_cost = second_cost + shift_cost
new_second_rev = second_rev * (new_second_cost / second_cost)


# Recompute Total Portfolio Revenue
# ------------------------------
new_total_revenue = (
    total_revenue
    - second_rev
    - weak_rev
    + new_second_rev
    + remaining_weak_revenue
)

new_roas = new_total_revenue / total_cost

# RESULTS
# ------------------------------
print("\n" + "="*60)
print("RESULTS")
print("="*60)
print(f"Baseline ROAS: {baseline_roas:.2f}x")
print(f"New ROAS:      {new_roas:.2f}x")
print(f"Δ ROAS:        {new_roas - baseline_roas:.2f}x")
print("="*60)

================================================================================
ROAS Reallocation Scenario
================================================================================
Baseline ROAS: 6.97x

Top Cells:
shape: (3, 5)
┌─────────────┬─────────┬────────────┬───────────────┬──────────┐
│ day_of_week ┆ Device  ┆ Total_Cost ┆ Total_Revenue ┆ ROAS     │
│ ---         ┆ ---     ┆ ---        ┆ ---           ┆ ---      │
│ str         ┆ str     ┆ f64        ┆ f64           ┆ f64      │
╞═════════════╪═════════╪════════════╪═══════════════╪══════════╡
│ Thursday    ┆ Tablet  ┆ 25968.46   ┆ 185279.0      ┆ 7.13477  │
│ Saturday    ┆ Desktop ┆ 23899.89   ┆ 170454.0      ┆ 7.131999 │
│ Monday      ┆ Mobile  ┆ 22986.13   ┆ 163323.0      ┆ 7.105285 │
└─────────────┴─────────┴────────────┴───────────────┴──────────┘

Second-Tier Cells:
shape: (5, 5)
┌─────────────┬─────────┬────────────┬───────────────┬──────────┐
│ day_of_week ┆ Device  ┆ Total_Cost ┆ Total_Revenue ┆ ROAS     │
│ ---         ┆ ---     ┆ ---        ┆ ---           ┆ ---      │
│ str         ┆ str     ┆ f64        ┆ f64           ┆ f64      │
╞═════════════╪═════════╪════════════╪═══════════════╪══════════╡
│ Friday      ┆ Desktop ┆ 27620.38   ┆ 195192.0      ┆ 7.066956 │
│ Tuesday     ┆ Tablet  ┆ 21928.26   ┆ 155108.0      ┆ 7.073429 │
│ Wednesday   ┆ Desktop ┆ 22765.03   ┆ 159351.0      ┆ 6.999815 │
│ Friday      ┆ Tablet  ┆ 24794.22   ┆ 172976.0      ┆ 6.976465 │
│ Sunday      ┆ Tablet  ┆ 19616.91   ┆ 137182.0      ┆ 6.993048 │
└─────────────┴─────────┴────────────┴───────────────┴──────────┘

Weak Cells:
shape: (9, 5)
┌─────────────┬─────────┬────────────┬───────────────┬──────────┐
│ day_of_week ┆ Device  ┆ Total_Cost ┆ Total_Revenue ┆ ROAS     │
│ ---         ┆ ---     ┆ ---        ┆ ---           ┆ ---      │
│ str         ┆ str     ┆ f64        ┆ f64           ┆ f64      │
╞═════════════╪═════════╪════════════╪═══════════════╪══════════╡
│ Sunday      ┆ Mobile  ┆ 21907.46   ┆ 149463.0      ┆ 6.822471 │
│ Wednesday   ┆ Mobile  ┆ 23044.52   ┆ 159406.0      ┆ 6.917306 │
│ Tuesday     ┆ Desktop ┆ 21001.65   ┆ 139790.0      ┆ 6.656144 │
│ Wednesday   ┆ Tablet  ┆ 23761.93   ┆ 159871.0      ┆ 6.728031 │
│ Saturday    ┆ Tablet  ┆ 25580.74   ┆ 176858.0      ┆ 6.913717 │
│ Friday      ┆ Mobile  ┆ 25663.58   ┆ 175379.0      ┆ 6.83377  │
│ Sunday      ┆ Desktop ┆ 20611.13   ┆ 142830.0      ┆ 6.929751 │
│ Tuesday     ┆ Mobile  ┆ 22477.64   ┆ 154862.0      ┆ 6.889602 │
│ Thursday    ┆ Mobile  ┆ 28687.82   ┆ 197525.0      ┆ 6.885326 │
└─────────────┴─────────┴────────────┴───────────────┴──────────┘

Reallocating 100% of weak spend → second-tier
Shift Cost: $212,736.47

============================================================
RESULTS
============================================================
Baseline ROAS: 6.97x
New ROAS:      7.04x
Δ ROAS:        0.08x
============================================================
  1. Target Segment 3–consumers who have high reach but low conversions. This segment represents wasted reach and untapped revenue potential. Some reasons that could lead to high reach and low conversions include: poor audience precision, generic creative, or low-intent placements.
    • Some ways to improve conversions include:
      • Tightening targeting by refining keyword optimization
        • According to Idea Marketing, the company should start by using comprehensive matching–broad keyword matching–and based on how well the keywords show in the Google Ads, refine keywors until the company can get an exact match for specific keywords.
      • Tailor the creative to this group to help personalize the message.
      • Have strong Call to Actions
        • According to Velocityppc, some powerful actions include being direct and action-oriented, creating urgency, and personalizing the CTA.
  2. After analyzing the BCG analysis, the company should focus on improving the efficiency of the Question Marks by using a 15% uplift.
    • This can be implemented by improving the Question Marks via creative experimentation, landing page optimization, or keyword refinement. Then, monitor weekly shifts to validate revenue growth and prevent oversaturation. By boosting the Question Marks’ efficiency by 15% the New ROAS is 7.04x, and the New ROMI is 603.9% which is a 7.33% increase without adjusting the overall spend.
print("="*80)
print("Improve Question Marks & Test Reallocation")
print("="*80)

# Aggregate baseline cost & revenue by BCG segment
seg_perf = (
    bcg
    .group_by("BCG_Segment")
    .agg([
        pl.col("Total_Cost").sum().alias("Cost"),
        pl.col("Monetary").sum().alias("Revenue"),
    ])
)

seg_pd = seg_perf.to_pandas().set_index("BCG_Segment")

# Baseline ROAS & ROMI
baseline_cost = seg_pd["Cost"].sum()
baseline_rev  = seg_pd["Revenue"].sum()

baseline_roas = baseline_rev / baseline_cost
baseline_romi = (baseline_rev - baseline_cost) / baseline_cost * 100

print("\n===== BASELINE (All Segments) =====")
print(f"Baseline ROAS: {baseline_roas:.2f}x")
print(f"Baseline ROMI: {baseline_romi:.1f}%\n")

print("--- Cost & ROMI by Segment ---")
seg_pd["ROMI_%"] = (seg_pd["Revenue"] - seg_pd["Cost"]) / seg_pd["Cost"] * 100
print(seg_pd[["Cost", "Revenue", "ROMI_%"]])

# Assume we can improve Question Marks efficiency 
qm_uplift = 0.15   # 15% uplift scenario

seg_opt = seg_pd.copy()
if "Question Marks" in seg_opt.index:
    seg_opt.loc["Question Marks", "Revenue"] *= (1 + qm_uplift)
else:
    print("\n[Warning] No 'Question Marks' segment found in BCG_Segment.")
    
# Recompute overall ROAS/ROMI after uplift 
uplift_cost = seg_opt["Cost"].sum()       
uplift_rev  = seg_opt["Revenue"].sum()

uplift_roas = uplift_rev / uplift_cost
uplift_romi = (uplift_rev - uplift_cost) / uplift_cost * 100

print("\n===== AFTER QUESTION MARKS UPLIFT (No Reallocation) =====")
print(f"QM revenue uplift assumed: {qm_uplift*100:.1f}%")
print(f"New ROAS: {uplift_roas:.2f}x")
print(f"New ROMI: {uplift_romi:.1f}%")
print(f"Δ ROAS vs baseline: {uplift_roas - baseline_roas:.2f}x")
print(f"Δ ROMI vs baseline: {uplift_romi - baseline_romi:.2f} percentage points")

# After uplift, reallocate some Dogs budget → Question Marks
shift_pct = 0.20   # 20% of Dogs cost

if "Dogs" in seg_opt.index and "Question Marks" in seg_opt.index:
    dogs_cost = seg_opt.loc["Dogs", "Cost"]
    dogs_rev  = seg_opt.loc["Dogs", "Revenue"]
    qm_cost   = seg_opt.loc["Question Marks", "Cost"]
    qm_rev    = seg_opt.loc["Question Marks", "Revenue"]

    # ROAS (not ROMI) multipliers AFTER uplift 
    roas_dogs = dogs_rev / dogs_cost if dogs_cost > 0 else 0
    roas_qm   = qm_rev   / qm_cost   if qm_cost > 0 else 0

    shift_amount = dogs_cost * shift_pct

    # Update costs
    seg_opt.loc["Dogs", "Cost"]           -= shift_amount
    seg_opt.loc["Question Marks", "Cost"] += shift_amount

    # Assume revenue scales linearly with cost at current ROAS
    seg_opt.loc["Dogs", "Revenue"]           = seg_opt.loc["Dogs", "Cost"] * roas_dogs
    seg_opt.loc["Question Marks", "Revenue"] = seg_opt.loc["Question Marks", "Cost"] * roas_qm

    # Recompute overall ROAS/ROMI after uplift + reallocation
    final_cost = seg_opt["Cost"].sum()
    final_rev  = seg_opt["Revenue"].sum()

    final_roas = final_rev / final_cost
    final_romi = (final_rev - final_cost) / final_cost * 100

    print("\n===== AFTER UPLIFT + REALLOCATION (Dogs → Question Marks) =====")
    print(f"Shifted {shift_pct*100:.1f}% of Dogs cost → Question Marks")
    print(f"Shift amount: ${shift_amount:,.2f}")
    print(f"Final ROAS: {final_roas:.2f}x")
    print(f"Final ROMI: {final_romi:.1f}%")
    print(f"Δ ROAS vs baseline: {final_roas - baseline_roas:.2f}x")
    print(f"Δ ROMI vs baseline: {final_romi - baseline_romi:.2f} percentage points")
else:
    print("\n[Info] Could not run Dogs → Question Marks reallocation "
          "(segment label missing).")
================================================================================
Improve Question Marks & Test Reallocation
================================================================================

===== BASELINE (All Segments) =====
Baseline ROAS: 6.97x
Baseline ROMI: 596.6%

--- Cost & ROMI by Segment ---
                     Cost    Revenue      ROMI_%
BCG_Segment                                     
Middle Ground   150900.16  1053486.0  598.134449
Cash Cows        28121.14   248736.0  784.516062
Question Marks   46049.18   246296.0  434.854258
Dogs            129430.89   805076.0  522.012257
Stars           149829.96  1159373.0  673.792505

===== AFTER QUESTION MARKS UPLIFT (No Reallocation) =====
QM revenue uplift assumed: 15.0%
New ROAS: 7.04x
New ROMI: 603.9%
Δ ROAS vs baseline: 0.07x
Δ ROMI vs baseline: 7.33 percentage points

===== AFTER UPLIFT + REALLOCATION (Dogs → Question Marks) =====
Shifted 20.0% of Dogs cost → Question Marks
Shift amount: $25,886.18
Final ROAS: 7.04x
Final ROMI: 603.5%
Δ ROAS vs baseline: 0.07x
Δ ROMI vs baseline: 6.97 percentage points

Conclusion

Focusing on high-performing segments and reallocating budget toward the strongest day-device combinations yields the highest incremental ROMI, identifying clear levers for both efficiency gains and growth.

Works Cited

https://velocityppc.com/blog/call-to-action-tips/ https://www.ideamktg.com/blog/google-ads-keywords-expert-guide/#how-to-optimize-your-campaigns-google-ads-keywords